Host Tree Genera Geographic Distribution Shifts

This R markdown summarizes the geographic distribution of different host tree genera using occurrence data from AutoArborist, OpenTrees, iNaturalist, and the USFS Forest Inventory and Analysis (FIA) dataset.

Understanding the distribution and relative abundance of host tree genera across the US may be informative as priors to improve image classification of tree genera.

Analyses Questions

  1. What are the observed geographic distributions of tree genera in the United States?

  2. How do different data sources (AutoArborist, OpenTrees, iNaturalist, FIA) vary in their characterization of tree genera distributions? Do urban trees genera have a different relative abundance than in natural settings?

  3. How can we characterize geographic distribution shifts in tree genera to potentially improve image classification?

Distribution Shifts

The AutoArborist Dataset (Beery et al., 2022) describe the importance of distribution shifts in tree genera on the accuracy of tree genera classification using CNNs.

“One of our primary challenges is to be able to do well on novel cities that were not part of the training set, but in order for a model to do so, it will have to contend with distribution shift, where the training distribution of cities differs from the novel test distribution on some new city.”

Label shift refers to when the marginal distribution of labels (genera) differs from city to city even if the appearance of an image does not change. This simply means that species distributions vary geographically (e.g., we tend to see Palm trees in Southern California and less in Canada).

Figure 4 visualizes the distribution shift between pairs of cities using the L1 norm distance between normalized genus distributions.

Label Shift vs Accuracy

There is a strong correlation between distribution similarity and performance, notably models can achieve the same accuracy with significantly less training data if their distributions are similar.

Put simply, tree genus classification accuracy declines when the distribution of genera differ between training and testing.

L1 Norm

The authors use the L1 norm to describe the distribution shifts between cities with different tree genera.

The L1 norm, also known as the Manhattan distance, between two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is calculated as:

\[ ||\mathbf{u} - \mathbf{v}||_1 = \sum_{i=1}^{n} |u_i - v_i| \]

where \(n\) is the dimensionality of the vectors.

## [1] "Example: The distribution of tree genera in city 1 (vector1) and city 2 (vector2) can be compared using the L1 norm."
## [1] "Count of tree genera in city 1:"
##  [1] 1 2 3 4 5 4 3 2 1 2 1
## [1] "Count of tree genera in city 2:"
##  [1] 8 7 7 2 3 1 2 0 2 1 2
## [1] "L1 distance between vector1 and vector2: 29"

AutoArborist Observations

## [1] "There are  23 cities sampled in Autoarborist."
## [1] "There are  309  tree genera sampled from Autoarborist."
## [1] "There are  1155612  tree genera records sampled from Autoarborist."

Examine Autoarborist Data

## [1] "Example City: Sioux Falls, SD"

Autoarborist Total Genera

## 
##        acer    fraxinus     quercus       ulmus      prunus       tilia 
##       82255       64204       56558       55309       53582       46406 
##       pyrus   gleditsia       malus    platanus liquidambar       pinus 
##       34918       33466       33443       30024       26319       24467 
##    magnolia       picea      ginkgo     zelkova      celtis   crataegus 
##       23027       21893       20614       18058       17974       17517 
##     populus    carpinus 
##       16468       15952

Autoarborist Genus Counts Per City

## [1] "Abundance of tree genera in cities of North America from OpenTrees"

Normalized Per-Genus Counts Autoarborist

## [1] "We compare tree genera between cities by first normalizing the number of tree genera counts. For cities with no tree genera present, we set that count to zero."
## [1] "The first ten cities and ten genera normalized per city."
##           City                abies             abutilon              acacia
## 1  Bloomington 0.000218866272707376                    0                   0
## 2      Boulder  0.00274622394207964                    0                   0
## 3      Buffalo 9.86436498150432e-05                    0                   0
## 4      Calgary 0.000384583688157569                    0                   0
## 5    Cambridge  0.00344717011383678                    0                   0
## 6     Columbus 4.14353194663131e-05                    0                   0
## 7       Denver  0.00106410911955335                    0                   0
## 8     Edmonton 5.96445186687343e-05                    0                   0
## 9    Kitchener  0.00026214037617144                    0                   0
## 10  LosAngeles 2.03194213028813e-05 6.77314043429376e-06 0.00992265073624037
##    acca                acer acrocomia             aesculus afrocarpus
## 1     0   0.226745458524841         0 0.000656598818122127          0
## 2     0   0.138184995631007         0  0.00291266175675113          0
## 3     0   0.242466091245376         0   0.0301849568434032          0
## 4     0   0.027690025547345         0  0.00255473449990385          0
## 5     0    0.36307519640853         0   0.0041686708353375          0
## 6     0   0.099306648987597         0  0.00276235463108754          0
## 7     0  0.0546703334669228         0   0.0110971379610564          0
## 8     0  0.0498926398663963         0  0.00745556483359179          0
## 9     0   0.268169604823383         0   6.553509404286e-05          0
## 10    0 0.00701020034949405         0 4.06388426057626e-05          0
##                 agathis
## 1                     0
## 2                     0
## 3                     0
## 4                     0
## 5                     0
## 6                     0
## 7                     0
## 8                     0
## 9                     0
## 10 1.35462808685875e-05

L1 and L2 Norm Per City Autoarborist

## [1] "We can calculate and compare the L1 norm distance of tree genera counts between cities."
## [1] "L1 distances of tree genera between cities shown using a dendrogram."
## [1] "Cities with similar distributions of tree genera cluster compared to dissimilar distributions."

## [1] "The L1 norm calculated between cities shows patterns of similar and dissimilar tree genera."
## [1] "L1 norm considers the absolute differences between two vectors."

## [1] "L2 norm considers both the magnitude and direction of differences between two vectors"

OpenTrees Observations

## [1] "There are  293  tree genera sampled from Open Trees."
## [1] "There are  5987025  tree genera records sampled from Open Trees."
## [1] "There are  70  cities sampled by Open Trees."

OpenTrees Total Genera

## [1] "The total count and distribution of tree genera sampled from OpenTrees across 70 cities."
## 
##        acer    fraxinus       ulmus     quercus       picea      prunus 
##     1031057      592574      408195      379096      337959      292677 
##       tilia   gleditsia     populus    platanus       pinus       malus 
##      281159      241142      184923      182368      168242      158180 
##       pyrus     syringa      ginkgo liquidambar      celtis      betula 
##      155733       85833       70740       62395       61908       58684 
##     zelkova    magnolia 
##       55300       52945

OpenTrees Genus Counts Per City

## [1] "Abundance of tree genera in cities of North America from OpenTrees"

Normalized Per-Genus Counts OpenTrees

## [1] "We compare tree genera between cities by first normalizing the number of tree genera counts. For cities with no tree genera present, we set that count to zero."
## [1] "The first ten cities and ten genera normalized per city."
##          source                abies abutilon              acacia acca
## 1     auburn_me   0.0194697597348799        0                   0    0
## 2      berkeley 0.000438994410137844        0 0.00798969826450877    0
## 3       boulder  0.00517255152580284        0                   0    0
## 4    bozeman_mt   0.0010688591983556        0                   0    0
## 5    buffalo-ny 0.000891099643560143        0                   0    0
## 6       calgary 0.000131245078309563        0                   0    0
## 7     cambridge  0.00156212856997317        0                   0    0
## 8  champaign_il 0.000420698359276399        0                   0    0
## 9       cornell  0.00696290669705026        0                   0    0
## 10       denver   0.0015654572190444        0                   0    0
##                  acer acrocomia            aesculus afrocarpus agathis
## 1    0.43827671913836         0 0.00248550124275062          0       0
## 2  0.0851356492727326         0  0.0126137727179607          0       0
## 3   0.116012941364435         0  0.0040741332481227          0       0
## 4   0.170030832476876         0  0.0055087358684481          0       0
## 5   0.346065861573655         0  0.0237005905197638          0       0
## 6  0.0130640934298297         0 0.00170826927323559          0       0
## 7   0.194688762862091         0  0.0030902978232078          0       0
## 8   0.331136353012668         0 0.00331884261206937          0       0
## 9   0.120394986707178         0  0.0035447525003165          0       0
## 10  0.145126678908894         0 0.00609297056940428          0       0

L1 and L2 Norm Per City OpenTrees

## [1] "L1 distances of tree genera between cities shown using a dendrogram."
## [1] "Cities with similar distributions of tree genera cluster compared to dissimilar distributions."

## [1] "The L1 norm calculated between cities shows patterns of similar and dissimilar tree genera."
## [1] "L1 norm considers the absolute differences between two vectors."

## [1] "L2 norm considers both the magnitude and direction of differences between two vectors"

iNaturalist Observations

## [1] "There are  278  tree genera selected from iNaturalist"
## [1] "There are  6349840  tree genera records sampled from iNaturalist."

iNat Total Genera

## [1] "The top reported tree genera in iNaturalist"
## 
##        quercus           acer          pinus          rubus        solanum 
##         394006         304046         288224         226971         189134 
##      euphorbia       lonicera           rhus         cornus       viburnum 
##         175256         167183         153505         141628         132228 
##         prunus      juniperus           ilex       veronica        populus 
##         119399         118538         116777         114632         112717 
##           rosa arctostaphylos       sambucus          ribes         pieris 
##         109679          93221          91521          89556          88215

iNat Count of Genera

## [1] "Abundance of tree genera in North America from iNaturalist"

Compare Trees in iNat to OpenTrees in NYC

## [1] "Do urban trees genera have a different relative abundance than in natural settings?"
## [1] "Compare OpenTrees and iNat records in NYC"
## [1] "Sampled iNat records within 0.2 decimal degrees"
## [1] "There are 76583 tree genera records from iNat in NYC."
## [1] "There are 652169 tree genera records from OpenTrees in NYC."

## [1] "Get counts of tree genera from iNat"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 125 x 3
##    genus          count normalized_count
##    <chr>          <int>            <dbl>
##  1 abies              4        0.0000522
##  2 abutilon         234        0.00306  
##  3 acer            4894        0.0639   
##  4 aesculus         835        0.0109   
##  5 ailanthus       5088        0.0664   
##  6 albizia          312        0.00407  
##  7 alnus             61        0.000797 
##  8 amelanchier      203        0.00265  
##  9 aralia           848        0.0111   
## 10 arctostaphylos     3        0.0000392
## # i 115 more rows
## [1] "Add additional genera as 0s"
## # A tibble: 309 x 3
##    genus      count normalized_count
##    <chr>      <dbl>            <dbl>
##  1 abies          4        0.0000522
##  2 abutilon     234        0.00306  
##  3 acacia         0        0        
##  4 acca           0        0        
##  5 acer        4894        0.0639   
##  6 acrocomia      0        0        
##  7 aesculus     835        0.0109   
##  8 afrocarpus     0        0        
##  9 agathis        0        0        
## 10 agonis         0        0        
## # i 299 more rows

## [1] "Get counts of tree genera from OpenTrees"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 68 x 3
##    genus       count normalized_count
##    <chr>       <int>            <dbl>
##  1 acer        88739        0.136    
##  2 aesculus     1287        0.00197  
##  3 ailanthus     756        0.00116  
##  4 albizia       163        0.000250 
##  5 alnus          47        0.0000721
##  6 amelanchier  2032        0.00312  
##  7 betula       1400        0.00215  
##  8 carpinus     4042        0.00620  
##  9 carya          99        0.000152 
## 10 castanea      173        0.000265 
## # i 58 more rows
## [1] "Add additional genera as 0s"
## # A tibble: 309 x 3
##    genus      count normalized_count
##    <chr>      <dbl>            <dbl>
##  1 abies          0          0      
##  2 abutilon       0          0      
##  3 acacia         0          0      
##  4 acca           0          0      
##  5 acer       88739          0.136  
##  6 acrocomia      0          0      
##  7 aesculus    1287          0.00197
##  8 afrocarpus     0          0      
##  9 agathis        0          0      
## 10 agonis         0          0      
## # i 299 more rows

## [1] "Calculate the L1 Norm (Distance) between Tree Genera from iNat and OpenTrees in NYC"
## [1] "The L1 Distance Between Tree Genera From iNat and OpenTrees in NYC is:  1.50539668794971"
## [1] "Which tree genera are contributing most to the differnece in tree genera distributions (high L1 distance)?"
## [1] "Number of genera from iNaturalist not found in OpenTrees: 63"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - iNat)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in iNaturalist"

iNat Genera Comparison to OpenTrees

## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - iNaturalist)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in iNaturalist"

Forest Inventory Analysis Observations

## [1] "There are 140 unique genera in the FIA dataset"
## [1] "There are 23935395 records in the FIA dataset"

##              used    (Mb)  gc trigger    (Mb)    max used     (Mb)
## Ncells  124072741  6626.2   347459002 18556.4   434323752  23195.4
## Vcells 3795865933 28960.2 11147029376 85045.1 13933786503 106306.4

FIA Total Genera

## [1] "The top reported tree genera in FIA"
## 
##        pinus      quercus         acer        abies      populus  liquidambar 
##      5403040      3405811      2437490      1171758      1128641       889301 
##        picea     fraxinus  pseudotsuga        nyssa       betula        carya 
##       695528       657422       641185       629892       625576       595846 
##    juniperus        ulmus        tsuga        thuja liriodendron       prunus 
##       580180       507803       485558       438573       403322       342478 
##        fagus       cornus 
##       283248       229597

FIA Count of Genera

## [1] "Abundance of tree genera in North America from FIA"

Compare Trees in FIA to OpenTrees in NYC

## [1] "Do urban trees genera have a different relative abundance than in natural settings?"
## [1] "Compare OpenTrees and FIA records around NYC"
## [1] "Sampled FIA records within 0.5 decimal degrees"
## [1] "There are 7225 tree genera records from FIA around NYC."
## [1] "There are 652169 tree genera records from OpenTrees in NYC."

## [1] "Get counts of tree genera from FIA"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 37 x 3
##    Genus         count normalized_count
##    <chr>         <int>            <dbl>
##  1 acer           1875         0.260   
##  2 ailanthus       123         0.0170  
##  3 amelanchier      36         0.00498 
##  4 betula          723         0.100   
##  5 carpinus         16         0.00221 
##  6 carya           256         0.0354  
##  7 castanea          1         0.000138
##  8 celtis            8         0.00111 
##  9 chamaecyparis     3         0.000415
## 10 cornus           28         0.00388 
## # i 27 more rows
## [1] "Add additional genera as 0s"

## [1] "Get counts of tree genera from OpenTrees"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 68 x 3
##    genus       count normalized_count
##    <chr>       <int>            <dbl>
##  1 acer        88739        0.136    
##  2 aesculus     1287        0.00197  
##  3 ailanthus     756        0.00116  
##  4 albizia       163        0.000250 
##  5 alnus          47        0.0000721
##  6 amelanchier  2032        0.00312  
##  7 betula       1400        0.00215  
##  8 carpinus     4042        0.00620  
##  9 carya          99        0.000152 
## 10 castanea      173        0.000265 
## # i 58 more rows
## [1] "Add additional genera as 0s"

## [1] "Calculate the L1 Norm (Distance) between Tree Genera from FIA and OpenTrees in NYC"
## [1] "The L1 Distance Between Tree Genera From FIA and OpenTrees in NYC is:  1.2181468058455"
## [1] "Which tree genera are contributing most to the differnece in tree genera distributions (high L1 distance)?"
## [1] "Number of genera from FIA not found in OpenTrees: 0"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - FIA)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in FIA"

FIA Genera Comparison to OpenTrees

## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - FIA)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in FIA"

FIA Urban Dataset Observations

## [1] "There are 135 unique genera in the FIA dataset"
## [1] "There are  36979  records in the FIA Urban dataset"

Urban FIA Total Genera

## [1] "The top reported tree genera in Urban FIA"
## 
##      acer   quercus juniperus     ulmus  fraxinus     pinus    celtis    prunus 
##      4718      4487      4210      2591      2071      1532      1514       964 
##     morus   rhamnus     carya   populus  triadica      ilex     thuja   juglans 
##       819       762       750       704       637       589       541       512 
##  prosopis     tilia     salix   robinia 
##       489       469       458       445

Urban FIA Genus Counts Per State

## [1] "Abundance of tree genera per state in Urban FIA"